Scenario: Financial Advisory Services is one of the departments supported by the Kenya Analytics Team. They have reached out to have a Loan Book designed that highlights key client metrics. Task Notes:
First task task was to load the data and do quick inspection of the first 10 rows.
import pandas as pd
from datetime import date
#Read in the dataset
dd = pd.read_csv("~/Downloads/challenges/data/analytics_specialist_hiring_dataset_final.csv")
#View the first 5 rows of data
dd.head(5)
| contract_reference | status | start_date | end_date | next_contract_payment_due_date | cumulative_amount_paid | expected_cumulative_amount_paid | nominal_contract_value | deposit_amount | birthdate | gender | l3_entity_id | name | expected_payment_progression | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | abc0001 | Completed | 2022-05-20T08:23:51.406303 | 2022-05-20T08:23:51.406303 | 2022-05-20T08:23:51.406303 | 0.0 | 0.00 | 0.00 | 0.0 | 1943-01-01 | 0 | 5121 | Cash Sales IPP | 1.000000 |
| 1 | abc0002 | Active | 2022-05-25T13:28:49.873746 | NaN | 2022-08-31T18:34:01.654276 | 5131.0 | 6164.16 | 8171.86 | 2000.0 | 1937-01-01 | 1 | 5073 | Individual loan | 0.754315 |
| 2 | abc0003 | Active | 2022-05-31T10:02:23.158972 | NaN | 2022-07-22T01:17:15.526819 | 3100.0 | 5273.52 | 6851.86 | 2000.0 | 1963-11-10 | 1 | 5072 | Individual loan | 0.769648 |
| 3 | abc0004 | Active | 2022-07-05T11:49:03.801563 | NaN | 2022-08-18T13:34:13.983433 | 4350.0 | 6142.35 | 10451.86 | 3000.0 | NaN | 1 | 5084 | Individual loan | 0.587680 |
| 4 | abc0005 | Active | 2022-05-31T06:31:25.977374 | NaN | 2022-07-04T13:30:36.436593 | 3450.0 | 7743.68 | 10271.86 | 2500.0 | 1998-05-10 | 1 | 5074 | Individual loan | 0.753873 |
Assuming the Finance request came in today(the day you receive the exercise), assign each client a PAR Status based on their repayment progression. This should be as a new column labelled ‘PAR status’ in the dataset.
#Format the next_contract_payment_due_date column
dd["next_contract_payment_due_date"] = pd.to_datetime(dd["next_contract_payment_due_date"])
#Get Todays date
today = date.today()
#Add a column for todays date
dd['today'] = pd.to_datetime(today)
#Calculate the number of days before or after the next contract payment due.
#Label the column as difference
dd['difference'] = (dd['today'] - dd['next_contract_payment_due_date']).dt.days
#Format date inputs to look presentable
dd[["start_date", "end_date"]] = dd[["start_date", "end_date"]].apply(pd.to_datetime)
dd['start_date'] = dd['start_date'].dt.strftime('%d/%m/%y')
dd['end_date'] = dd['end_date'].dt.strftime('%d/%m/%y')
dd['next_contract_payment_due_date'] = dd['next_contract_payment_due_date'].dt.strftime('%d/%m/%y')
dd['today'] = dd['today'].dt.strftime('%d/%m/%y')
#Create Function to calculate PAR based on defination provided
def par (difference_in_days):
if difference_in_days < 0:
par = "On Time"
elif 0 <= difference_in_days <= 7:
par = "PAR0-7"
elif 8 <= difference_in_days <= 30:
par = "PAR8-30"
elif 31 <= difference_in_days <= 90:
par = "PAR31-90"
elif difference_in_days >= 91:
par = "PAR90+"
return par
#Apply the function
dd['PAR'] = dd['difference'].apply(par)
#Subset the data for proper viewing and quick inspection.
par_dataset = dd[["contract_reference", "next_contract_payment_due_date", "today", "difference", "PAR"]]
par_dataset.head(5)
| contract_reference | next_contract_payment_due_date | today | difference | PAR | |
|---|---|---|---|---|---|
| 0 | abc0001 | 20/05/22 | 25/10/22 | 157 | PAR90+ |
| 1 | abc0002 | 31/08/22 | 25/10/22 | 54 | PAR31-90 |
| 2 | abc0003 | 22/07/22 | 25/10/22 | 94 | PAR90+ |
| 3 | abc0004 | 18/08/22 | 25/10/22 | 67 | PAR31-90 |
| 4 | abc0005 | 04/07/22 | 25/10/22 | 112 | PAR90+ |
Derived by taking the Cumulative Amount Paid divided by (Cumulative Expected Amount Paid - Deposit)
#Calculate Current collection rate
dd["current_collection_rate"] = dd["cumulative_amount_paid"]/dd["expected_cumulative_amount_paid"]
#Subset the data for proper viewing and quick inspection.
current_collection_rate = dd[["contract_reference", "cumulative_amount_paid", "expected_cumulative_amount_paid", "current_collection_rate"]]
current_collection_rate.head(5)
| contract_reference | cumulative_amount_paid | expected_cumulative_amount_paid | current_collection_rate | |
|---|---|---|---|---|
| 0 | abc0001 | 0.0 | 0.00 | NaN |
| 1 | abc0002 | 5131.0 | 6164.16 | 0.832392 |
| 2 | abc0003 | 3100.0 | 5273.52 | 0.587843 |
| 3 | abc0004 | 4350.0 | 6142.35 | 0.708198 |
| 4 | abc0005 | 3450.0 | 7743.68 | 0.445525 |
Which is the expected amount to have been paid at this time minus what has been paid.
#Calculate total amount in arrears
dd["total_arrears"] = dd["expected_cumulative_amount_paid"]-dd["cumulative_amount_paid"]
#Subset the data for proper viewing and quick inspection.
total_arrears = dd[["contract_reference", "expected_cumulative_amount_paid", "cumulative_amount_paid", "total_arrears"]]
total_arrears.head(5)
| contract_reference | expected_cumulative_amount_paid | cumulative_amount_paid | total_arrears | |
|---|---|---|---|---|
| 0 | abc0001 | 0.00 | 0.0 | 0.00 |
| 1 | abc0002 | 6164.16 | 5131.0 | 1033.16 |
| 2 | abc0003 | 5273.52 | 3100.0 | 2173.52 |
| 3 | abc0004 | 6142.35 | 4350.0 | 1792.35 |
| 4 | abc0005 | 7743.68 | 3450.0 | 4293.68 |
This is the cumulative amount paid divided by the nominal contract value.
#Calculate total Payment Progression
dd["payment_progression"] = dd["cumulative_amount_paid"]/dd["nominal_contract_value"]
#Subset the data for proper viewing and quick inspection.
payment_progression = dd[["contract_reference", "cumulative_amount_paid", "nominal_contract_value", "payment_progression"]]
payment_progression.head(5)
| contract_reference | cumulative_amount_paid | nominal_contract_value | payment_progression | |
|---|---|---|---|---|
| 0 | abc0001 | 0.0 | 0.00 | NaN |
| 1 | abc0002 | 5131.0 | 8171.86 | 0.627886 |
| 2 | abc0003 | 3100.0 | 6851.86 | 0.452432 |
| 3 | abc0004 | 4350.0 | 10451.86 | 0.416194 |
| 4 | abc0005 | 3450.0 | 10271.86 | 0.335869 |
This is the expected cumulative amount paid divided by the nominal contract value .
#The Expected Payment progression
dd["expected_payment_progression"] = dd["expected_cumulative_amount_paid"]/dd["nominal_contract_value"]
#Subset the data for proper viewing and quick inspection.
expected_payment_progression = dd[["contract_reference", "expected_cumulative_amount_paid", "nominal_contract_value", "expected_payment_progression"]]
expected_payment_progression.head(5)
| contract_reference | expected_cumulative_amount_paid | nominal_contract_value | expected_payment_progression | |
|---|---|---|---|---|
| 0 | abc0001 | 0.00 | 0.00 | NaN |
| 1 | abc0002 | 6164.16 | 8171.86 | 0.754315 |
| 2 | abc0003 | 5273.52 | 6851.86 | 0.769648 |
| 3 | abc0004 | 6142.35 | 10451.86 | 0.587680 |
| 4 | abc0005 | 7743.68 | 10271.86 | 0.753873 |
Any entry in the name column that contains ‘Individual’ is an Individual Loan, any entry that contains ‘Group’ is a Group Loan, any entry that contains ‘Paygo’ is a Paygo Loan and any entry that contains ‘Cash’ is a Cash Sale.
#write function to calaculate loan type
def loan_type(name):
loan_type = ""
if 'Individual' in name:
loan_type = "Individual Loan"
elif 'Group' in name:
loan_type = "Group Loan"
elif 'PayGo' in name:
loan_type = "Paygo Loan"
elif 'Cash' in name:
loan_type = "Cash Sale"
return loan_type
#Apply the function
dd['loan_type'] = dd['name'].apply(loan_type)
#Subset the data for proper viewing and quick inspection.
loan_type = dd[["contract_reference", "name", "loan_type"]]
loan_type.head(5)
| contract_reference | name | loan_type | |
|---|---|---|---|
| 0 | abc0001 | Cash Sales IPP | Cash Sale |
| 1 | abc0002 | Individual loan | Individual Loan |
| 2 | abc0003 | Individual loan | Individual Loan |
| 3 | abc0004 | Individual loan | Individual Loan |
| 4 | abc0005 | Individual loan | Individual Loan |
write an SQL query that assigns clients with PAR Statuses based on repayment, similar to the python task above. Note: This is just for the PAR Status and not all the above-derived metrics.
I created a MYSQL database instance on AWS (Free tier), so that I'm able to ensure the queries I'm writing are working. The database will also help me better present the solution.
Then used SQLAlchemy to connect Pandas API to the database.
#Connecting to my database.
#No risk of sharing the DB credentials.
import sqlalchemy as sa
#Set Up my credentials
engine = sa.create_engine('postgresql://kevegnjhongrsq:7a9c90ae154a234452ef3ff25d2968512d9155aced9ac757cf20adc34fdf09b7@ec2-54-243-226-219.compute-1.amazonaws.com:5432/dd7lo89ujeuuiu')
#Query my loanbook table
#I have updated my database table with columns which I intend to use only.
df = pd.read_sql_query('select * from loanbook limit 5;',con=engine)
df
| contract_reference | status | next_contract_payment_due_date | today | date_difference | |
|---|---|---|---|---|---|
| 0 | abc0044 | Active | 2022-06-19 09:21:05.925129 | 2022-10-25 09:31:07.209581 | 128 |
| 1 | abc0045 | Active | 2022-08-26 02:16:24.213503 | 2022-10-25 09:31:07.209581 | 60 |
| 2 | abc0046 | Active | 2022-10-25 06:23:21.975138 | 2022-10-25 09:31:07.209581 | 0 |
| 3 | abc0047 | Active | 2022-10-22 09:10:06.427091 | 2022-10-25 09:31:07.209581 | 3 |
| 4 | abc0048 | Active | 2022-10-24 04:30:48.873412 | 2022-10-25 09:31:07.209581 | 1 |
The next step was to alter the table and add a column with a timestamp of todays date.
/*Add today column */
ALTER TABLE loanbook ADD COLUMN today TIMESTAMP DEFAULT NOW();
#Query my loanbook table to check results of above query.
df = pd.read_sql_query('select contract_reference, status,next_contract_payment_due_date,today from loanbook limit 5;',con=engine)
df
| contract_reference | status | next_contract_payment_due_date | today | |
|---|---|---|---|---|
| 0 | abc0044 | Active | 2022-06-19 09:21:05.925129 | 2022-10-25 09:31:07.209581 |
| 1 | abc0045 | Active | 2022-08-26 02:16:24.213503 | 2022-10-25 09:31:07.209581 |
| 2 | abc0046 | Active | 2022-10-25 06:23:21.975138 | 2022-10-25 09:31:07.209581 |
| 3 | abc0047 | Active | 2022-10-22 09:10:06.427091 | 2022-10-25 09:31:07.209581 |
| 4 | abc0048 | Active | 2022-10-24 04:30:48.873412 | 2022-10-25 09:31:07.209581 |
This was achieved by first add date_difference column to the loanbook table. Using the command below
/* Difference in days between next contract payment_due date */
ALTER TABLE loanbook ADD date_difference INTEGER;
The next step was to compute the actual date difference.
/* Difference in days between next contract payment_due date */
UPDATE loanbook t1 SET date_difference = (select extract(day from today - next_contract_payment_due_date) as days from loanbook t2 where t1.contract_reference = t2.contract_reference);
#Query my loanbook table to check results of above query.
df = pd.read_sql_query('select contract_reference, status,next_contract_payment_due_date,today,date_difference from loanbook limit 5;',con=engine)
df
| contract_reference | status | next_contract_payment_due_date | today | date_difference | |
|---|---|---|---|---|---|
| 0 | abc0044 | Active | 2022-06-19 09:21:05.925129 | 2022-10-25 09:31:07.209581 | 128 |
| 1 | abc0045 | Active | 2022-08-26 02:16:24.213503 | 2022-10-25 09:31:07.209581 | 60 |
| 2 | abc0046 | Active | 2022-10-25 06:23:21.975138 | 2022-10-25 09:31:07.209581 | 0 |
| 3 | abc0047 | Active | 2022-10-22 09:10:06.427091 | 2022-10-25 09:31:07.209581 | 3 |
| 4 | abc0048 | Active | 2022-10-24 04:30:48.873412 | 2022-10-25 09:31:07.209581 | 1 |
Finally I used select and case to compute PAR.
/* Calculate PAR */
select contract_reference,status, loanbook.date_difference,
(case when date_difference < 0 then 'On Time'
when date_difference >= 0 and date_difference <= 7 then 'PAR0-7'
when date_difference >= 8 and date_difference <= 30 then 'PAR8-30'
when date_difference >= 31 and date_difference <= 90 then 'PAR31-90'
when date_difference >= 91 then 'PAR90+'
else 'something else'
end) as PAR FROM loanbook;;
#Final query to calculte PAR.
df = pd.read_sql_query(""" select contract_reference,status, loanbook.date_difference,
(case when date_difference < 0 then 'On Time'
when date_difference >= 0 and date_difference <= 7 then 'PAR0-7'
when date_difference >= 8 and date_difference <= 30 then 'PAR8-30'
when date_difference >= 31 and date_difference <= 90 then 'PAR31-90'
when date_difference >= 91 then 'PAR90+'
else 'something else'
end) as PAR FROM loanbook;
""",con=engine)
df.head(5)
| contract_reference | status | date_difference | par | |
|---|---|---|---|---|
| 0 | abc0044 | Active | 128 | PAR90+ |
| 1 | abc0045 | Active | 60 | PAR31-90 |
| 2 | abc0046 | Active | 0 | PAR0-7 |
| 3 | abc0047 | Active | 3 | PAR0-7 |
| 4 | abc0048 | Active | 1 | PAR0-7 |
This was achieved by counting the most popular loan types. Group loans where most popular loans (56%) followed by individual loans (35%).
Kindly note the chart is interactive, and you can download png file of the chart to embed in a report
#Plot Pie chart - Plotly
import plotly.express as px
dd3 = dd.groupby(['loan_type']).size().reset_index(name = 'counts')
dd3 = dd3.sort_values(by = ["counts"],ascending=False)
#Draw plotly
fig =px.pie(dd3, values='counts', names='loan_type',hole=.3)
fig.show()
Group loans were the most active loan types follwedd by individual loans. Group loans also led in completed loans closely followed by cash sale loans.
#Popular loans loan type and status
dd2 = dd.groupby(['status', 'loan_type']).size().reset_index(name = 'counts')
#View the raw data
dd2
| status | loan_type | counts | |
|---|---|---|---|
| 0 | Active | Group Loan | 462 |
| 1 | Active | Individual Loan | 292 |
| 2 | Completed | Cash Sale | 84 |
| 3 | Completed | Group Loan | 100 |
| 4 | Completed | Individual Loan | 60 |
| 5 | Completed | Paygo Loan | 1 |
| 6 | Defaulted | Group Loan | 1 |
Ploting the most popular loans per status using grouped bar chart.
Kindly note the chart is interactive, and you can download png file of the chart to embed in a report
import plotly.graph_objects as go
status=list(dd2["status"].unique())
fig = go.Figure(data=[
go.Bar(name='Group Loan', x=status, y=list(dd2.loc[dd2["loan_type"] == "Group Loan"]["counts"])),
go.Bar(name='Individual Loan', x=status, y=list(dd2.loc[dd2["loan_type"] == "Individual Loan"]["counts"])),
go.Bar(name='Cash Sale', x=status, y=list(dd2.loc[dd2["loan_type"] == "Cash Sale"]["counts"])),
go.Bar(name='Paygo Loan', x=status, y=list(dd2.loc[dd2["loan_type"] == "Paygo Loan"]["counts"]))
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()
Get the loan type with most arrears. Group loans had most arrears at 1,297,484.
top_arrears_per_loan_type = dd.groupby(['loan_type'])['total_arrears'].sum().reset_index()
top_arrears_per_loan_type = top_arrears_per_loan_type.sort_values(by = ["total_arrears"],ascending=False)
top_arrears_per_loan_type
| loan_type | total_arrears | |
|---|---|---|
| 1 | Group Loan | 1297484.26 |
| 2 | Individual Loan | 579104.54 |
| 0 | Cash Sale | 0.00 |
| 3 | Paygo Loan | -864.00 |
The individual with contract reference number abc0856 had the highest amount of arrears at 16,556.
top_5_highest_borrowers = dd.groupby(['contract_reference'])['total_arrears'].sum().reset_index()
top_5_highest_borrowers = top_5_highest_borrowers.sort_values(by = ["total_arrears"],ascending=False)
top_5_highest_borrowers.head(5)
| contract_reference | total_arrears | |
|---|---|---|
| 855 | abc0856 | 16556.14 |
| 640 | abc0641 | 15769.96 |
| 469 | abc0470 | 14392.79 |
| 885 | abc0886 | 14106.22 |
| 387 | abc0388 | 13398.98 |
The mean of total arrears was 1,875. The total arrears also have a perculiar destribution with first quartile being 0.
dd['total_arrears'].describe()
count 1000.000000 mean 1875.724800 std 3053.248041 min -14396.000000 25% 0.000000 50% 796.450000 75% 3149.250000 max 16556.140000 Name: total_arrears, dtype: float64
The region with identity l3_entity_id equal to 5121, had the highest total arrears at 1,774,04090. . This represents over 96% of all total arrears
top_5_highest_region = dd.groupby(['l3_entity_id'])['total_arrears'].sum().reset_index()
top_5_highest_region = top_5_highest_region.sort_values(by = ["total_arrears"],ascending=False)
top_5_highest_region = top_5_highest_region.head(5)
top_5_highest_region
| l3_entity_id | total_arrears | |
|---|---|---|
| 9 | 5121 | 1774090.16 |
| 7 | 5084 | 30259.73 |
| 2 | 5074 | 15314.20 |
| 3 | 5076 | 14317.78 |
| 1 | 5073 | 12732.97 |
Kindly note the chart is interactive, and you can download png file of the chart to embed in a report
#Draw plotly
fig =px.pie(top_5_highest_region, values='total_arrears', names='l3_entity_id',hole=.3)
fig.show()